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Abstract 

We describe the CoNLL-2001 shared task: di- 
viding text into clauses. We give background 
information on the data sets, present a general 
overview of the systems that have taken part in 
the shared task and briefly discuss their perfor- 
mance. 

1 Introduction 

The CoNLL-2001 shared task aims at discov- 
ering clause boundaries with machine learning 
methods. Why clauses? Clauses are structures 
used in applications such as Text-to-Speech con- 



version (Ejerhed, 1988), text-alignment ([Papa- 



georgiou, 1997) and machine translation (|Leffa 



1998 ). Ejerhed (1988) described clauses as a 



natural structure above chunks: 

It is a hypothesis of the author's cur- 
rent clause-by-clause processing the- 
ory, that a unit corresponding to the 
basic clause is a stable and easily rec- 
ognizable surface unit and that is is 
also an important partial result and 
building block in the construction od 
a richer linguistic representation that 
encompasses syntax as well as seman- 



tics and discourse structure ([Ejerhed, 



1985 , page 220) 



The goal of this shared task is to evaluate 
automatic methods, especially machine learn- 
ing methods, for finding clause boundaries in 
text. We have selected a training and test cor- 
pus for performing this evaluation. The task has 
been divided in three parts in order to allow ba- 
sic machine learning methods to participate in 
this task by processing the data in a bottom-up 
fashion. 



Defining clause boundaries is not trivial ( Leffa, 
1998). In this task, the gold standard clause 



segmentation is provided by the Penn Treebank 
( Marcus et al., 1993| ). The guidelines of the 
Penn Treebank describe in detail how sentences 
are segmented into clauses ( Bies et al., 1995 ). 
Here is an example of a sentence and its clauses 
obtained from Wall Street Journal section 15 of 
the Penn Treebank ([Marcus et al., 1993): 



(S Coach them in 
(S-NOM handling complaints) 
(SBAR-PRP so that 
(S they can resolve problems immediately) 



The clauses of this sentence have been enclosed 
between brackets. A tag next to the open 
bracket denotes the type of the clause. 

In the CoNLL-2001 shared task, the goal is to 
identify clauses in text. Since clauses can be em- 
bedded in each other, this task is considerably 
more difficult than last year's task, recognizing 
non-embedded text chunks. For that reason, 
we have disregarded type and function informa- 
tion of the clauses: every clause has been tagged 
with S rather than with an elaborate tag such as 
SBAR-PRP. Furthermore, the shared task has 
been divided in three parts: identifying clause 
starts, recognizing clause ends and finding com- 
plete clauses. The results obtained for the first 
two parts can be used in the third part of the 
task. 

3 Data and Evaluation 

This CoNLL shared task works with roughly 
the same sections of the Penn Treebank as the 
widely used data set for base noun phrase recog- 



nition ( [Ramshaw and Marcus, 1995 ): WSJ sec- 
tions 15-18 of the Penn Treebank as training 
material, section 20 as development material for 
tuning the parameter of the learner and sec- 
tion 21 as test datag. The data sets contain 
tokens (words and punctuation marks), infor- 
mation about the location of sentence bound- 
aries and information about clause boundaries. 
Additionally, a part-of-speech (POS) tag and 
a chunk tag was assigned to each token by a 
standard POS tagger ( Brill, 1994|) an d a chunk- 
ing program ( [Tjong Kim Sang, 2000 ). We used 
these POS and chunking tags rather than the 
Treebank ones in order to make sure that the 
performance rates obtained for this data are re- 
alistic estimates for data for which no Treebank 
tags are available. In the clause segmentation 
we have only included clauses in the Treebank 
which had a label starting with S thus disre- 
garding clauses with label RRC or FRAG. All 
clause labels have been converted to S. 

Different schemes for encoding phrase infor- 
mation have been used in the data: 

• B-X, I-X and O have been used for mark- 
ing the first word in a chunk of type X, a 
non-initial word in an X chunk and a word 
outside of an}^ chunk, respectively^ (see also 
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Three tags can be found next to each word, re- 
spectively denoting the information for the first, 
second and third part of the shared task. The 
goal of this task is to predict the test data seg- 
mentation as well as possible with a model built 
from the training data. 

The performance in this task is measured 
with three rates. First, the percentage of de- 
tected starts, ends or clauses that are correct 
(precision). Second, the percentage of starts, 
ends or clauses in the data that were found 
by the learner (recall). And third, the F^=i 
rate which is equal to (/3^-|-l)*precision*recall 
/ (/9^*precision+recall) with /3=1 ( [van Rijsber 



Tjong Kimlang" and Buchhok"(feo5o|))-'" l^en^Wp. The latter rate has been used as the 



• S, E and X mark a clause start, a clause end 
and neither a clause start nor a clause end, 
respectively. These tags have been used in 
the first and second part of the shared task. 

• (S*, *S) and * denote a clause start, a 
clause end and neither a clause start nor a 
clause end, respectively. The first two can 
be used in combination with each other. 
For example, (S*S) marks a word where a 
clause starts and ends, and *S)S) marks a 
word where two clauses end. These tags are 
used in the third part of the shared task. 

The first two phrase encodings were inspired by 
the representation used by Ramshaw and Mar- 
cus ( 1995 ). Here is an example of the clause 
encoding schemes: 



^ These clause data sets are available at 
http://lcg-www.uia.ac.be/conll2001/clauses/ 



target for optimization. 

4 Results 

Six systems have participated in the shared 
task. Two of them used boosting and the 
others used techniques which were connection- 
ist, memory-based, statistical and symbolic. 
Patrick and Goyal ( 2001| ) applied the AdaBoost 
algorithm for boosting the performance of deci- 
sion graphs. The latter are an extension of de- 
cision trees: they allow tree nodes to have more 
than one parent. The boosting algorithm im- 
proves the performance of the decision graphs 
by assigning weights to the training data items 
based on how accurately they have been clas- 
sified. Hammerton ( |2001 ) used a feed- forward 
neural network architecture, long short-term 
memory, for predicting embedded clause struc- 
tures. The network processes sentences word- 
by- word. Memory cells in its hidden layer en- 
able it to remember states with information 
about the current clause. 
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test part 1 


precision 


recall 


F/3=l 


Carreras & Mar. 
Tjong Kim Sang 
Molina & Pla 
Dejean 
Patrick & Goyal 


93.96% 
92.91% 
89.54% 
93.76% 
89.79% 


89.59% 
85.08% 
86.01% 
81.90% 

84.88% 


91.72 

88.82 
87.74 
87.43 

87.27 
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98.44% 


36.58% 


53.34 
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90.04% 


88.41% 


89.22 
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Tjong Kim Sang 


84.72% 


79.96% 
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Patrick & Goyal 


80.11% 


83.47% 


81.76 




Molina & Pla 


79.57% 


77.68% 


78.61 
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Dejean 


99.28% 


48.90% 


65.47 




baseline 


98.44% 


48.90% 


65.34 



Table 1: The performance of five systems while 
processing the development data and the test 
data for part 1 of the shared task: finding clause 
starts. The baseline results have been obtained 
by a system that assumes that every sentence 
consists of one clause which contains the com- 
plete sentence. 



Table 2: The performance of five systems while 
processing the development data and the test 
data for part 2 of the shared task: identifying 
clause ends. The baseline results have been ob- 
tained by a system that assumes that every sen- 
tence consists of one clause which contains the 
complete sentence. 



Dejean ( |2001| ) predicted clause boundaries 
with his symbolic learner ALLiS (Architecture 
for Learning Linguistic Structure). It is based 
on theory refinement, which means that it 
adapts grammars. The learner selects a set 
of rules based on their prediction accuracy of 
classes in a training corpus. Tjong Kim Sang 



(2001) evaluated a memory-based learner while 
using different combinations of features describ- 
ing items which needed to be classified. His 
learner was well suited for identifying clause 
starts and clause ends but less suited for the 
predicting complete clauses. Therefore he used 
heuristic rules for converting the part one and 
two results of the shared task to results for the 
third part. 



Molina and Pla (2001) have applied a spe- 
cialized Hidden Markov Model (HMM) to the 
shared task. They interpreted the three parts of 
the shared task as tagging problems and made 
the HMM find the most probable sequence of 
tags given an input sequence. In the third part 



^Performances on lines with a * sufRx are different 
from those in the paper version of the CoNLL-2001 
proceedings. 



of the task they limited the number of possible 
output tags and used rules for fixing bracketing 
problems. Carreras and Marquez ( |2001| ) con- 
verted the clausing task to a set of binary de- 
cisions which they modeled with decision trees 
which are combined by AdaBoost. The system 
uses features which in some cases contain rele- 
vant information about a complete sentence. It 
produces a list of clauses from which the ones 
with the highest confidence scores will be pre- 
sented as output. 

We have derived baseline scores for the differ- 
ent parts of the shared task by evaluating a sys- 
tem that assigns one clause to every sentence. 
Each of these clauses completely covers a sen- 
tence. All participating systems perform above 
the baselines. 

In the development data for part 1 of the 
shared task, at 30 times all five participating 
systems (Hammerton's only did part 3 of the 
shared task) predicted a clause start at a posi- 
tion where there was none. About half of these 
were in front of the word to. The situation in 
which all five systems missed a clause start oc- 
curred 205 times at positions with different sue- 
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precision 
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Carreras & Mar. 
Patrick & Goyal 
Molina k Pla 
Tjong Kim Sang 
Dejean 
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87.18% 
78.19% 
70.98% 
76.54% 
73.93% 
59.85% 


82.48% 
67.63% 
72.31% 
67.20% 
62.44% 
55.56% 


84.77 
72.53 
71.64 
71.57 
67.70 
57.62 


baseline 


96.32% 


35.77% 


52.17 



test part 3 


precision 


recall 


F/3=l 


Carreras & Mar. 


84.82% 


73.28% 


78.63 


Molina & Pla 


70.89% 


65.57% 


68.12 


Tjong Kim Sang 


76.91% 


60.61% 


67.79 


Patrick & Goyal 


73.75% 


60.00% 


66.17 


Dejean 


72.56% 


54.55% 


62.77 


Hammerton 


55.81% 


45.99% 


50.42 


baseline 


98.44% 


31.48% 


47.71 



Table 3: The performance of the six systems 
while processing the development data and the 
test data for part 3 of the shared task: recog- 
nizing complete clauses. The baseline results 
have been obtained by a system that assumes 
that every sentence consists of one clause which 
contains the complete sentence. 

ceeding words. It seems that many of these er- 
rors were caused by a missing comma immedi- 
ately before the clause start. 

In three cases, the five systems unanimously 
found an end of a clause where there was none 
in the development data of part 2 of the shared 
task. All these occurred at the end of 'sentences' 
which consisted of a single noun phrase or a 
single adverbial phrase. In 205 cases all five 
systems missed a clause end. These errors often 
occurred right before punctuation signs. 

It is hard to make a similar overview for 
part 3 of the shared task. Therefore we have 
only looked at the accuracies of two clause tags: 
(S(S* (starting two clauses) and *S)S) (ending 
two clauses). Never did more than three of the 
six systems correctly predicted the start of two 
clauses. The best performing system for this 
clause tag was the one of Carreras and Marquez 
with about 52% recall. Three of the systems did 
not find back any of the double clause starts and 
the average recall score of the six was 21%. The 
end of two clauses was correctly predicted by all 



six systems about 0.5% of the times it occurred. 
Again, the system of Carreras and Marquez was 
best with 63% recall while the average system 
found back 33%. 

The six result tables show that the system of 
Carreras and Marquez clearly outperforms the 
other five systems on all parts of the shared 
task. They were the only one to use input 
features that contained information of a com- 
plete sentence and it seems that this was a good 
choice. 

5 Related Work 

There have been some earlier studies in identi- 
fying clauses. Abney ( 199C| ) used a clause filter 
as a part of his CASS parser. It consists of 
two parts: one for recognizing basic clauses and 
one for repairing difficult cases (clauses with- 
out subjects and clauses with additional VPs). 
Ejerhed ( 1996D showed that a parser can benefit 
from automatically identified clause boundaries 



in discourse. Papageorgiou ( 1997 ) used a set of 
hand-crafted rules for identifying clause bound- 
aries in one text. Leffa (1998) wrote a set of 



clause identification rules and applied them to 
a small corpus. The performance was very good, 
with recall rates above 90%. Orasan ( 2000| ) used 
a memory-based learner with post-processing 
rules for predicting clause boundaries in Su- 
sanne corpus. His system obtained F rates of 
about 85 for this particular task. 

6 Concluding Remarks 

We have presented the CoNLL-2001 shared 
task: clause identification. The task was split 
in three parts: recognizing clause starts, find- 
ing clause ends and identifying complete, pos- 
sibly embedded, clauses. Six systems have 
participated in this shared task. They used 
various machine learning techniques, boosting, 
connectionist methods, decision trees, memory- 
based learning, statistical techniques and sym- 
bolic methods. On all three parts of the shared 
task the boosted decision tree system of Car- 
reras and Marquez (2001) performed best. It 
obtained an F^=i rate of 78.63 for the third part 
of the shared task. 

Acknowledgements 

We would like to thank SIGNLL for giving us 
the opportunity to organize this shared task and 



our colleagues of the Seminar fiir Sprachwis- 
senschaft in Tiibingen, CNTS - Language Tech- 
nology Group in Antwerp, and the ILK group in 
Tilburg for valuable discussions and comments. 
This research has been funded by the European 
TMR network Learning Computational Gram- 
r^. 



mars 



References 

Steven Abney. 1990. Rapid Incremental Parsing 
with Repair. In Proceedings of the 8th New OED 
Conference: Electronic Text Research. University 
of Waterloo, Ontario. 

Ann Bies, Mark Fergusson, Karen Katz, and Robert 
Maclntyre. 1995. Bracketing Guidelines for Tree- 
bank II Style Penn Treebank Project. Technical 
report. University of Pennsylvania. 

Eric Brill. 1994. Some advances in rule-based 
part of speech tagging. In Proceedings of the 
Twelfth National Conference on Artificial Intel- 
ligence (AAAI-94)- Seattle, Washington. 

Xavier Carreras and Lluis Marquez. 2001. Boost- 
ing Trees for Clause Splitting. In Proceedings of 
CoNLL-2001. Toulouse, France. 

Herve Dejean. 2001. Using ALLiS for Clausing. In 
Proceedings of CoNLL-2001. Toulouse, France. 

Eva Ejerhed. 1988. Finding clauses in unrestricted 
text by finitary and stochastic methods. In Pro- 
ceedings of the second Conference on Applied Nat- 
ural Language Processing, pages 219-227. 

Eva Ejerhed. 1996. Finite state segmentation of dis- 
course into clauses. In Proceedings of the ECAI 
'96 Workshop on Extended finite state models of 
language. ECAI '96, Budapest, Hungary 

James Hammerton. 2001. Clause identification with 
Long Short-Term Memory. In Proceedings of 
CoNLL-2001. Toulouse, France. 

Vilson J. Leffa. 1998. Clause Processing in Complex 
Sentences. In Proceedings of LREC'98. Granada, 
Spain. 

Mitchell P. Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building a large 
annotated corpus of English: the Penn Treebank. 
Computational Linguistics, 19(2). 

Antonio Molina and Ferran Pla. 2001. Clause De- 
tection using HMM. In Proceedings of CoNLL- 
2001. Toulouse, France. 

Constantin Orasan. 2000. A hybrid method for 
clause splitting in unrestricted English texts. In 
Proceedings of ACIDCA'2000. Monasth, Tunisia. 

H. V. Papageorgiou. 1997. Clause recognition in 
the framework of alignment. In R. Mitkov and 
N. Nicolov, editors. Recent Advances in Natural 



Language Processing. John Benjamins Publishing 
Company, Amsterdam/Philadelphia. 

Jon D. Patrick and Ishaan Goyal. 2001. Boosted 
Decision Graphs for NLP Learning Tasks. In Pro- 
ceedings of CoNLL-2001. Toulouse, France. 

Lance A. Ramshaw and Mitchell P. Marcus. 
1995. Text Chunking Using Transformation- 
Based Learning. In Proceedings of the Third A CL 
Workshop on Very Large Corpora. Cambridge, 
MA, USA. 

Erik F. Tjong Kim Sang. 2001. Memory-Based 
Clause Identification. In Proceedings of CoNLL- 
2001. Toulouse, France. 

Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. 
Introduction to the CoNLL-2000 Shared Task: 
Chunking. In Proceedings of the CoNLL-2000 and 
LLL-2000. Lisbon, Portugal. 

Erik F. Tjong Kim Sang. 2000. Text Chunking by 
System Combination. In Proceedings of CoNLL- 
2000 and LLL-2000. Lisbon, Portugal. 

C.J. van Rijsbergen. 1975. Information Retrieval. 
Buttersworth. 



http://lcg-www.uia.ac.be/ 



